In this document, we will briefly introduce R and installing all the required packages for this course. A descriptive analysis of the dataset is then operated.

Installing the required packages

The following commands will load the package if they are already installed. If they are not yet installed, they will be installed and loaded afterwards. Note that for Windows users, Rtools is required for some packages (e.g., CASdatasets). This list may not be exhaustive and other packages may be required in other notebooks.

if (!require("xts")) install.packages("xts")
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
if (!require("sp")) install.packages("sp")
## Loading required package: sp
if (!require("CASdatasets")) install.packages("CASdatasets", repos = "http://cas.uqam.ca/pub/", type="source")
## Loading required package: CASdatasets
## Loading required package: survival
if (!require("caret")) install.packages("caret")
## Loading required package: caret
## Loading required package: ggplot2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("mgcv")) install.packages("mgcv")
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
if (!require("dplyr")) install.packages("dplyr")
## Loading required package: dplyr
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:nlme':
## 
##     collapse
## The following objects are masked from 'package:xts':
## 
##     first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
if (!require("gridExtra")) install.packages("gridExtra")
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
if (!require("visreg")) install.packages("visreg")
## Loading required package: visreg
if (!require("MASS")) install.packages("MASS")
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
if (!require("plotrix")) install.packages("plotrix")
## Loading required package: plotrix
if (!require("xtable")) install.packages("xtable")
## Loading required package: xtable
if (!require("scales")) install.packages("scales")
## Loading required package: scales
## 
## Attaching package: 'scales'
## The following object is masked from 'package:plotrix':
## 
##     rescale
if (!require("broom")) install.packages("broom")
## Loading required package: broom
if (!require("stringi")) install.packages("stringi")
## Loading required package: stringi
if (!require("arrow")) install.packages("arrow")
## Loading required package: arrow
## 
## Attaching package: 'arrow'
## The following object is masked from 'package:utils':
## 
##     timestamp
if (!require("patchwork")) install.packages("patchwork")
## Loading required package: patchwork
## 
## Attaching package: 'patchwork'
## The following object is masked from 'package:MASS':
## 
##     area
if (!require("sf")) install.packages("sf")
## Loading required package: sf
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
if (!require("htmlwidgets")) install.packages("htmlwidgets")
## Loading required package: htmlwidgets
if (!require("leaflet")) install.packages("leaflet")
## Loading required package: leaflet
## 
## Attaching package: 'leaflet'
## The following object is masked from 'package:xts':
## 
##     addLegend
require("CASdatasets")
require("ggplot2")
require("mgcv")
require("caret")
require("gridExtra")
require("dplyr")
require("visreg")
require("MASS")
require("plotrix")
require("xtable")
require("scales")
require("broom")
require("stringi")
require("arrow")
require("patchwork")
require("sf")
require("leaflet");

In this jupyter notebook, we will use the following options to set the width and height of our plots.

options(repr.plot.width = 8, repr.plot.height = 4, repr.plot.res = 250)

Getting started with the dataset

Loading the dataset

We will now load a dataset from the CASdatasets package. In case you were not able to install the CASdatasets package, we also provide a parquet file of the dataset (see more on that below).

We can simply load the dataset with the following command:

data("freMTPLfreq")

To keep it simple and illustrative, we will only keep a subset of this dataset. Each line corresponds to a policy. We will restrict ourselves to the policies covering a vehicle aged between 0 and 25 years. Also, we will only keep policies that were covered for a maximum of one year.

We will use the tidyverse universe in this course, as it can be easier to read (and writing clear code is important!). Subsetting can be done with the filter function.

dataset <- freMTPLfreq %>% filter(Exposure <= 1 & Exposure >= 0 & CarAge <= 25)

Note the pipe operator which allows to chain operations. We could also have written the following. We check that we obtain the same result with the all.equal(dataset, dataset_alternative) function. To save some memory we then remove the alternative dataset.

dataset_alternative <- freMTPLfreq %>%
  filter(Exposure <= 1) %>%
  filter(Exposure >= 0) %>%
  filter(CarAge <= 25)

sprintf(
  "Are the two datasets equal ? %s",
  ifelse(all.equal(dataset, dataset_alternative), "Yes", "No")
)
## [1] "Are the two datasets equal ? Yes"
rm(dataset_alternative)
write_parquet(dataset, sink = "../data/dataset.parquet", compression = "gzip")

We will save the dataset into a parquet file, so we don’t need to load the CASdatasets package anymore and filter the data.

For those that could not install this package, now is the time to load the provided parquet file.

dataset <- read_parquet(file = "../data/dataset.parquet")

Checking the dataset

We can check that the dataset is correctly loaded with the following functions.A good idea is to check whether the dataset has been loaded correctly. To do this, the following tools can be used:

  • head allows to visualize the first 6 lines of the dataset.
head(dataset)
## # A tibble: 6 × 10
##   PolicyID ClaimNb Exposure Power CarAge DriverAge Brand    Gas   Region Density
##   <fct>      <int>    <dbl> <fct>  <int>     <int> <fct>    <fct> <fct>    <int>
## 1 1              0     0.09 g          0        46 Japanes… Dies… Aquit…      76
## 2 2              0     0.84 g          0        46 Japanes… Dies… Aquit…      76
## 3 3              0     0.52 f          2        38 Japanes… Regu… Nord-…    3003
## 4 4              0     0.45 f          2        38 Japanes… Regu… Nord-…    3003
## 5 5              0     0.15 g          0        41 Japanes… Dies… Pays-…      60
## 6 6              0     0.75 g          0        41 Japanes… Dies… Pays-…      60
  • str allows to see the format of the different variables. We will typically distinguish numerical variables (real numbers or integers) and factors (categorical data).
str(dataset)
## tibble [410,864 × 10] (S3: tbl_df/tbl/data.frame)
##  $ PolicyID : Factor w/ 413169 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ClaimNb  : int [1:410864] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Exposure : num [1:410864] 0.09 0.84 0.52 0.45 0.15 0.75 0.81 0.05 0.76 0.34 ...
##  $ Power    : Factor w/ 12 levels "d","e","f","g",..: 4 4 3 3 4 4 1 1 1 6 ...
##  $ CarAge   : int [1:410864] 0 0 2 2 0 0 1 0 9 0 ...
##  $ DriverAge: int [1:410864] 46 46 38 38 41 41 27 27 23 44 ...
##  $ Brand    : Factor w/ 7 levels "Fiat","Japanese (except Nissan) or Korean",..: 2 2 2 2 2 2 2 2 1 2 ...
##  $ Gas      : Factor w/ 2 levels "Diesel","Regular": 1 1 2 2 1 1 2 2 2 2 ...
##  $ Region   : Factor w/ 10 levels "Aquitaine","Basse-Normandie",..: 1 1 8 8 9 9 1 1 8 6 ...
##  $ Density  : int [1:410864] 76 76 3003 3003 60 60 695 695 7887 27000 ...
  • summary allows to compute for each variable some summary statistics.
summary(dataset)
##     PolicyID         ClaimNb           Exposure            Power      
##  1      :     1   Min.   :0.00000   Min.   :0.002732   f      :95432  
##  2      :     1   1st Qu.:0.00000   1st Qu.:0.200000   g      :90663  
##  3      :     1   Median :0.00000   Median :0.530000   e      :76784  
##  4      :     1   Mean   :0.03925   Mean   :0.559997   d      :67660  
##  5      :     1   3rd Qu.:0.00000   3rd Qu.:1.000000   h      :26558  
##  6      :     1   Max.   :4.00000   Max.   :1.000000   j      :17978  
##  (Other):410858                                        (Other):35789  
##      CarAge         DriverAge                                   Brand       
##  Min.   : 0.000   Min.   :18.0   Fiat                              : 16653  
##  1st Qu.: 3.000   1st Qu.:34.0   Japanese (except Nissan) or Korean: 79031  
##  Median : 7.000   Median :44.0   Mercedes, Chrysler or BMW         : 19087  
##  Mean   : 7.413   Mean   :45.3   Opel, General Motors or Ford      : 37287  
##  3rd Qu.:12.000   3rd Qu.:54.0   other                             :  9738  
##  Max.   :25.000   Max.   :99.0   Renault, Nissan or Citroen        :216684  
##                                  Volkswagen, Audi, Skoda or Seat   : 32384  
##       Gas                        Region          Density     
##  Diesel :205299   Centre            :159426   Min.   :    2  
##  Regular:205565   Ile-de-France     : 69576   1st Qu.:   67  
##                   Bretagne          : 41986   Median :  288  
##                   Pays-de-la-Loire  : 38541   Mean   : 1987  
##                   Aquitaine         : 31211   3rd Qu.: 1414  
##                   Nord-Pas-de-Calais: 27111   Max.   :27000  
##                   (Other)           : 43013

If one needs some help on a function, typing a question mark and the name of the function in the console opens the help file of the function. For instance,

?head
## starting httpd help server ... done

Descriptive Analysis of the dataset

We will now proceed with a descriptive analysis of this dataset. We will now have a descriptive analysis of the portfolio. The different variables available are

names(dataset)
##  [1] "PolicyID"  "ClaimNb"   "Exposure"  "Power"     "CarAge"    "DriverAge"
##  [7] "Brand"     "Gas"       "Region"    "Density"

PolicyID

The variable PolicyID related to a unique identifier of the policy. We can check that every policy appears only once in the dataset

length(unique(dataset$PolicyID)) == nrow(dataset)
## [1] TRUE

Another possibility is to check the frequency of each PolicyID using the function table. The result is a table that shows for each PolicyID the corresponding number of lines in the dataset. We can then use a second time the function table in this result to show the frequency. We expect to have only ones (with possibily zeros), meaning each PolicyID has a unique line.

table(table(dataset$PolicyID))
## 
##      0      1 
##   2305 410864

To what corresponds the 0 ?

It appears that in this dataset the variable PolicyID is a factor. A factor variable has different levels. It appears that some PolicyID may be missing here (removed from the dataset, probably when we filtered out some policies). It is as if we had a 3-level categorical variable, for instance, color of a car, which takes three possible values: red, blue, gray, but in our dataset, we would only have red and blue cars. Gray would still be a level, but with no observation (i.e. no row) corresponding to a gray car.

To remove unused levels, we can use on the function droplevels.

dataset$PolicyID <- droplevels(dataset$PolicyID)

Exposure in month

The Exposure reveals the fraction of the year during which the policyholder is in the portfolio. We can compute the total exposure by summing the policyholders’ exposures. Here we find:

sprintf("%s years", label_number(accuracy = 0.1)(sum(dataset$Exposure)))
## [1] "230 082.6 years"

We can show the number of months of exposure on a table. The function cut allows to categorize (bin) a numerical variable. We can specify where to ‘break’ and give a name to each level using the labels argument. The output is a factor variable.

table_exposures <- table(cut(dataset$Exposure,
  breaks = seq(from = 0, to = 1, by = 1 / 12),
  labels = 1:12
))
table_exposures
## 
##      1      2      3      4      5      6      7      8      9     10     11 
##  62633  29216  33452  24213  19463  29565  18835  14438  21518  13653  12422 
##     12 
## 131456

Using the function prop.table, it is possible to represent this information in relative terms show the number of months of exposure on a table.

exposures_prop <- prop.table(table_exposures)
round(exposures_prop, 4) * 100
## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 15.24  7.11  8.14  5.89  4.74  7.20  4.58  3.51  5.24  3.32  3.02 32.00

Alternatively, we can use a barplot, using ggplot2 !

ggplot(dataset) +
  geom_bar(
    aes(x = cut(Exposure,
      breaks = seq(from = 0, to = 1, by = 1 / 12),
      labels = 1:12
    ))
  ) +
  scale_x_discrete(name = "Number of months") +
  scale_y_continuous(name = "Number of Policies", label = label_number()) +
  ggtitle("Exposure in months")

What if we also want to show the percentage on the bars ?

ggplot(dataset, aes(
  x = cut(Exposure, breaks = seq(from = 0, to = 1, by = 1 / 12), labels = 1:12),
  label = scales::percent(prop.table(after_stat(count)), accuracy = 0.1)
)) +
  geom_bar() +
  geom_text(
    stat = "count",
    vjust = -0.5,
    size = 3
  ) +
  scale_x_discrete(name = "Number of months") +
  scale_y_continuous(
    name = "Number of Policies",
    label = label_number()
  ) +
  ggtitle("Exposure in months")

Note that a barplot is used to plot factor variables (categorical variables). In our case, we categorized the variable Exposure using the function cut. If we do not want to categorize this variable, we should use a histogram. We can specify the number of bins (= 12) or the binwidth (= 1/12).

ggplot(dataset, aes(x = Exposure)) +
  geom_histogram(binwidth = 1 / 12, fill = "gray", color = "white") +
  scale_x_continuous(
    name = "Exposure in fraction of years",
    breaks = seq(0, 1, 1 / 12),
    labels = round(seq(0, 1, 1 / 12), 3)
  ) +
  scale_y_continuous(name = "Number of Polices", labels = label_number()) +
  ggtitle("Exposure in fraction of years")

If you are not familiar with ggplot, I could recommend this cheat-sheet: https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf

Number of claims : ClaimNb

ggplot(dataset, aes(x = ClaimNb)) +
  geom_bar() +
  geom_label(
    stat = "count",
    aes(label = percent(prop.table(after_stat(count)),
      accuracy = 0.01
    )),
    vjust = 0.5
  ) +
  scale_x_continuous(name = "Number of Claims") +
  scale_y_continuous(
    name = "Number of Polices",
    labels = label_number()
  ) +
  ggtitle("Proportion of policies by number of claims")

We can compute the average claim frequency in this portfolio, taking into account the different exposures.

label_percent(accuracy = 0.01)(sum(dataset$ClaimNb) / sum(dataset$Exposure))
## [1] "7.01%"

Let us now look at the other variables.

Power

The variable Power is a categorized variable, related to the power of the car. The levels of the variable are ordered categorically. We can see the different levels of a factor by using the function level in R:

levels(dataset$Power)
##  [1] "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"

We can see the number of observations in each level of the variable, by using the function table.

table(dataset$Power)
## 
##     d     e     f     g     h     i     j     k     l     m     n     o 
## 67660 76784 95432 90663 26558 17398 17978  9270  4593  1758  1276  1494

Remember however, that in insurance, exposures may differ from one policyholder to another. Hence, the table above, does NOT measure the exposure in each level of the variable Power. We can use the functions group_by and summarise from package dplyr to give us the exposure in each level of the variable.

Check out the cheatsheet https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf

power_summary <- dataset %>%
  group_by(Power) %>%
  summarise(
    totalExposure = sum(Exposure),
    Number.Observations = length(Exposure)
  )
power_summary
## # A tibble: 12 × 3
##    Power totalExposure Number.Observations
##    <fct>         <dbl>               <int>
##  1 d            37571.               67660
##  2 e            44436.               76784
##  3 f            55652.               95432
##  4 g            51297.               90663
##  5 h            13800.               26558
##  6 i             9244.               17398
##  7 j             9228.               17978
##  8 k             4493.                9270
##  9 l             2134.                4593
## 10 m              919.                1758
## 11 n              642.                1276
## 12 o              666.                1494

We can show this on a plot as well:

plot_power_expo <- ggplot(power_summary, aes(
  x = Power,
  y = totalExposure,
  fill = Power,
  color = Power,
  label = label_number()(totalExposure)
)) +
  geom_bar(stat = "identity") +
  geom_text(stat = "identity", vjust = -0.5) +
  scale_y_continuous(
    name = "Exposure in years",
    labels = label_number(),
    expand = expansion(mult = c(0, .15))
  ) +
  scale_colour_discrete(guide = "none") +
  scale_fill_discrete(guide = "none")
plot_power_expo

Let us now look at the observed claim frequency in each level

power_summary <- dataset %>%
  group_by(Power) %>%
  summarise(
    totalExposure = sum(Exposure),
    Number.Observations = length(Exposure),
    Number.Claims = sum(ClaimNb),
    Obs.Claim.Frequency = sum(ClaimNb) / sum(Exposure)
  )
power_summary
## # A tibble: 12 × 5
##    Power totalExposure Number.Observations Number.Claims Obs.Claim.Frequency
##    <fct>         <dbl>               <int>         <int>               <dbl>
##  1 d            37571.               67660          2350              0.0625
##  2 e            44436.               76784          3198              0.0720
##  3 f            55652.               95432          3986              0.0716
##  4 g            51297.               90663          3450              0.0673
##  5 h            13800.               26558           998              0.0723
##  6 i             9244.               17398           715              0.0773
##  7 j             9228.               17978           710              0.0769
##  8 k             4493.                9270           375              0.0835
##  9 l             2134.                4593           162              0.0759
## 10 m              919.                1758            74              0.0805
## 11 n              642.                1276            55              0.0857
## 12 o              666.                1494            54              0.0811

We can compute the ratio to the portfolio claim frequency and plot the claim frequencies.

portfolio_cf <- sum(dataset$ClaimNb) / sum(dataset$Exposure)
# Can also be written as
portfolio_cf <- with(dataset, sum(ClaimNb) / sum(Exposure))

plot_power_claimfreq <- ggplot(power_summary, aes(
  x = Power,
  y = Obs.Claim.Frequency,
  color = Obs.Claim.Frequency,
  fill = Obs.Claim.Frequency,
  label = percent(Obs.Claim.Frequency, accuracy = 0.01)
)) +
  geom_bar(stat = "identity") +
  geom_hline(aes(yintercept = portfolio_cf),
    color = "black",
    linewidth = 2,
    linetype = "dashed",
    alpha = 0.33
  ) +
  geom_label(vjust = -0.21, fill = "white", alpha = 0.25) +
  annotate(
    geom = "text",
    x = "m", y = portfolio_cf,
    vjust = -0.5,
    label = paste(
      "Average claim freq. of portfolio: ",
      percent(portfolio_cf, accuracy = 0.01)
    ),
    color = "black"
  ) +
  scale_y_continuous(
    name = "Observed Claim Frequency", labels = label_percent(accuracy = 0.01),
    expand = expansion(mult = c(0, .15))
  ) +
  theme(legend.position = "none")
plot_power_claimfreq

With the package patchwork it is “ridiculously easy” (not my words :-) ) to combine separate ggplots. See below and see https://patchwork.data-imaginist.com/

plot_power_expo / plot_power_claimfreq

CarAge

The vehicle age, in years. This is the first continuous variable that we encounter (although it only takes discrete values).

ggplot(
  dataset,
  aes(x = CarAge)
) +
  geom_bar() +
  scale_x_continuous(name = "Age of the Car", breaks = seq(0, 100, 5)) +
  scale_y_continuous(name = "Number of Polices", labels = label_number())

Alternatively, we can use a histogram, with a binwidth of 1.

ggplot(
  dataset,
  aes(x = CarAge)
) +
  geom_histogram(binwidth = 1, color = "black", fill = "white") +
  scale_x_continuous(name = "Age of the Car", breaks = seq(0, 100, 5)) +
  scale_y_continuous(name = "Number of Polices", labels = label_number())

Again, here, the exposures are not considered on the barplot/histogram. We can use ddply to correct this.

carage_summary <- dataset %>%
  group_by(CarAge) %>%
  summarise(
    totalExposure = sum(Exposure),
    Number.Observations = length(Exposure)
  )
carage_summary
## # A tibble: 26 × 3
##    CarAge totalExposure Number.Observations
##     <int>         <dbl>               <int>
##  1      0         8711.               29984
##  2      1        18138.               37749
##  3      2        17347.               32505
##  4      3        15818.               28652
##  5      4        14966.               25761
##  6      5        14446.               23813
##  7      6        13790.               22433
##  8      7        12909.               20960
##  9      8        13084.               21091
## 10      9        12664.               20708
## # ℹ 16 more rows

Then, we can plot the data onto a barplot, as before.

ggplot(carage_summary, aes(
  x = CarAge,
  y = totalExposure,
  fill = factor(CarAge),
  color = factor(CarAge),
  label = label_number(accuracy = 1)(totalExposure)
)) +
  geom_bar(stat = "identity") +
  geom_text(
    stat = "identity",
    color = "black",
    hjust = 0.25,
    vjust = 0.5,
    angle = 45,
    check_overlap = TRUE
  ) +
  scale_x_continuous(breaks = seq(0, 100, 5)) +
  scale_y_continuous(
    name = "Exposure in years",
    labels = label_number(),
    expand = expansion(add = c(1000, 0), mult = c(0, .15))
  ) +
  theme(legend.position = "none")

We can see a large difference, specially for new cars, which makes sense ! Indeed, let us look at the Exposure for recent vehicles, using a boxplot for instance.

ggplot(
  dataset %>% filter(CarAge < 5),
  aes(x = CarAge, y = Exposure, group = CarAge)
) +
  geom_boxplot() +
  ggtitle("Exposure of recent cars")

Let us now also compute the claim frequencies by age of car and plot them.

carage_summary <- dataset %>%
  group_by(CarAge) %>%
  summarise(
    totalExposure = sum(Exposure),
    Number.Observations = length(Exposure),
    Number.Claims = sum(ClaimNb),
    Obs.Claim.Freq = sum(ClaimNb) / sum(Exposure)
  )

portfolio_cf <- with(dataset, sum(ClaimNb) / sum(Exposure))

ggplot(carage_summary, aes(
  x = CarAge,
  y = Obs.Claim.Freq,
  label = percent(Obs.Claim.Freq, accuracy = 0.01)
)) +
  geom_point() +
  geom_line() +
  geom_hline(
    yintercept = portfolio_cf,
    color = "black", linewidth = 2,
    linetype = "dashed",
    alpha = 0.33
  ) +
  annotate(
    geom = "text",
    x = 20, y = portfolio_cf,
    vjust = -0.5,
    label = paste(
      "Average claim freq. of portfolio: ",
      percent(portfolio_cf, accuracy = 0.01)
    ),
    color = "black"
  ) +
  scale_x_continuous(name = "Age of the Car", breaks = seq(0, 100, 5)) +
  scale_y_continuous(
    name = "Observed Claim Frequency",
    labels = label_percent(accuracy = 0.01)
  ) +
  theme(legend.position = "none")

DriverAge

Similarly to the Age of the Car, we can visualize the Age of the Drivers.

driverage_summary <- dataset %>%
  group_by(DriverAge) %>%
  summarise(
    totalExposure = sum(Exposure),
    Number.Observations = length(Exposure),
    Number.Claims = sum(ClaimNb),
    Obs.Claim.Freq = sum(ClaimNb) / sum(Exposure)
  )
head(driverage_summary, 9)
## # A tibble: 9 × 5
##   DriverAge totalExposure Number.Observations Number.Claims Obs.Claim.Freq
##       <int>         <dbl>               <int>         <int>          <dbl>
## 1        18          148.                 497            46         0.310 
## 2        19          637.                1610           171         0.268 
## 3        20         1021.                2529           216         0.212 
## 4        21         1278.                3040           207         0.162 
## 5        22         1563.                3548           239         0.153 
## 6        23         1830.                4156           230         0.126 
## 7        24         2163.                4862           230         0.106 
## 8        25         2496.                5556           240         0.0962
## 9        26         2937.                6468           312         0.106

We can show the Exposures by Age of the Driver…

ggplot(driverage_summary, aes(x = DriverAge, y = totalExposure)) +
  geom_bar(stat = "identity", width = 0.8) +
  scale_y_continuous(name = "Exposure in years", labels = label_number()) +
  scale_x_continuous(name = "Age of the Driver", breaks = seq(10, 150, 10))

… and the observed claim frequency by Age.

ggplot(driverage_summary, aes(x = DriverAge, y = Obs.Claim.Freq)) +
  geom_line() +
  geom_point() +
  scale_y_continuous(
    name = "Observed Claim Frequency",
    labels = percent,
    breaks = seq(0, 0.50, 0.05)
  ) +
  scale_x_continuous(name = "Age of the Driver", breaks = seq(10, 150, 10))

Brand

The variable Brand is a categorized variable, related to the brand of the car. We can see the different levels of a factor by using the function level in R:

levels(dataset$Brand)
## [1] "Fiat"                               "Japanese (except Nissan) or Korean"
## [3] "Mercedes, Chrysler or BMW"          "Opel, General Motors or Ford"      
## [5] "other"                              "Renault, Nissan or Citroen"        
## [7] "Volkswagen, Audi, Skoda or Seat"
brand_summary <- dataset %>%
  group_by(Brand) %>%
  summarise(
    totalExposure = sum(Exposure),
    Number.Observations = length(Exposure),
    Number.Claims = sum(ClaimNb),
    Obs.Claim.Freq = sum(ClaimNb) / sum(Exposure)
  )

brand_summary
## # A tibble: 7 × 5
##   Brand           totalExposure Number.Observations Number.Claims Obs.Claim.Freq
##   <fct>                   <dbl>               <int>         <int>          <dbl>
## 1 Fiat                    9464.               16653           714         0.0754
## 2 Japanese (exce…        31229.               79031          2078         0.0665
## 3 Mercedes, Chry…        10392.               19087           828         0.0797
## 4 Opel, General …        21734.               37287          1731         0.0796
## 5 other                   5676.                9738           412         0.0726
## 6 Renault, Nissa…       133460.              216684          8905         0.0667
## 7 Volkswagen, Au…        18127.               32384          1459         0.0805
ggplot(brand_summary, aes(
  x = reorder(Brand, totalExposure),
  y = totalExposure,
  fill = Brand,
  label = label_number()(totalExposure)
)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  guides(fill = "none") +
  scale_x_discrete(name = "") +
  scale_y_continuous("Exposure in years",
    labels = label_number(),
    expand = expansion(mult = c(0, 0.10))
  ) +
  geom_label()

Let us now look at the claim frequency by Brand of the car.

ggplot(brand_summary, aes(
  x = reorder(Brand, Obs.Claim.Freq),
  y = Obs.Claim.Freq,
  fill = Brand,
  label = percent(Obs.Claim.Freq, accuracy = 0.1)
)) +
  geom_bar(stat = "identity") +
  geom_label(hjust = +1.2) +
  coord_flip() +
  guides(fill = "none") +
  ggtitle("Observed Claim Frequencies by Brand of the car") +
  scale_x_discrete(name = "Brand") +
  scale_y_continuous(
    "Observed claim Frequency",
    labels = label_percent(accuracy = 0.1)
  )

Gas

The variable Gas is a categorized variable, related to the fuel of the car. We can see the different levels of a factor by using the function level in R:

levels(dataset$Gas)
## [1] "Diesel"  "Regular"
gas_summary <- dataset %>%
  group_by(Gas) %>%
  summarise(
    totalExposure = sum(Exposure),
    Number.Observations = length(Exposure),
    Number.Claims = sum(ClaimNb),
    Obs.Claim.Freq = sum(ClaimNb) / sum(Exposure)
  )
ggplot(
  gas_summary,
  aes(
    x = Gas,
    y = totalExposure,
    fill = Gas,
    label = number(totalExposure)
  )
) +
  geom_bar(stat = "identity") +
  geom_label() +
  guides(fill = "none") +
  scale_x_discrete(name = "Fuel") +
  scale_y_continuous(
    name = "Total Exposure (in years)",
    labels = label_number()
  )

There seems to be a similar amount of Diesel and Regular gas vehicles in the portfolio. It is generally expected that Diesel have a higher claim frequency. Does this also hold on our dataset ?

ggplot(
  gas_summary,
  aes(
    x = Gas, y = Obs.Claim.Freq,
    fill = Gas,
    label = percent(Obs.Claim.Freq, accuracy=0.01)
  )
) +
  geom_bar(stat = "identity") +
  geom_label() +
  guides(fill = "none") +
  scale_x_discrete(name = "Fuel") +
  scale_y_continuous("Observed claim Frequency", labels = label_percent())

Region

The variable Region is a categorized variable, related to the region of the place of residence. We can see the different levels of a factor by using the function level in R:

levels(dataset$Region)
##  [1] "Aquitaine"          "Basse-Normandie"    "Bretagne"          
##  [4] "Centre"             "Haute-Normandie"    "Ile-de-France"     
##  [7] "Limousin"           "Nord-Pas-de-Calais" "Pays-de-la-Loire"  
## [10] "Poitou-Charentes"

What are the Exposures in each region ? What are the observed claim frequencies ?

region_summary <- dataset %>%
  group_by(Region) %>%
  summarize(
    totalExposure = sum(Exposure),
    Number.Observations = length(Exposure),
    Number.Claims = sum(ClaimNb),
    Obs.Claim.Freq = sum(ClaimNb) / sum(Exposure)
  )
region_summary
## # A tibble: 10 × 5
##    Region         totalExposure Number.Observations Number.Claims Obs.Claim.Freq
##    <fct>                  <dbl>               <int>         <int>          <dbl>
##  1 Aquitaine             14223.               31211          1052         0.0740
##  2 Basse-Normand…         6622.               10848           451         0.0681
##  3 Bretagne              27657.               41986          1867         0.0675
##  4 Centre               101843.              159426          6460         0.0634
##  5 Haute-Normand…         3147.                8726           219         0.0696
##  6 Ile-de-France         30017.               69576          2575         0.0858
##  7 Limousin               2376.                4539           196         0.0825
##  8 Nord-Pas-de-C…        11347.               27111           939         0.0828
##  9 Pays-de-la-Lo…        21792.               38541          1569         0.0720
## 10 Poitou-Charen…        11059.               18900           799         0.0722

Creating Maps

We can plot a map with the observed claim frequencies and the total Exposure. We first need to obtain the shape files (which contain the borders of each administrative area.)

  1. Download shapefile from http://www.diva-gis.org/gData
  2. Extract all the files from the zip files, in a directory called shapefiles in your working directory
# From http://www.diva-gis.org/gData
area <- sf::st_read(
  "shapefiles/FRA_adm1.shp",
  options = "ENCODING=UTF8"
)
## options:        ENCODING=UTF8 
## Reading layer `FRA_adm1' from data source 
##   `C:\Users\barigou\Dropbox\UCLouvain\Cours\LACTU2150 - Analyse statistique des données en assurance - GLM - GAM\UCLouvain-LACTU2150\1. Introduction\shapefiles\FRA_adm1.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 22 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -5.143751 ymin: 41.33375 xmax: 9.560416 ymax: 51.0894
## Geodetic CRS:  WGS 84
leaflet(area) %>%
  addPolygons(
    color = "#444444", weight = 1,
    opacity = 1.0, fillOpacity = 0.5,
    highlightOptions = highlightOptions(
      color = "white", weight = 2,
      bringToFront = TRUE
    )
  ) %>%
  addTiles()

We are now going to include our data into the map

area_w_data <- area %>% full_join(region_summary, by = c("NAME_1" = "Region"))
colors <- colorNumeric("YlOrRd", area_w_data$totalExposure)
# Create leaflet map
leaflet(area_w_data) %>%
  addPolygons(
    color = "#444444", weight = 1, smoothFactor = 0.5,
    opacity = 1.0, fillOpacity = 0.5,
    fillColor = ~ colors(totalExposure),
    highlightOptions = highlightOptions(
      color = "white", weight = 2,
      bringToFront = TRUE,
    ),
    popup = ~ paste(
      "Region: ", NAME_1, "<br>",
      "Exposure: ", round(totalExposure, 1)
    )
  ) %>%
  addTiles() %>%
  leaflet::addLegend(
    position = "bottomright",
    pal = colors,
    values = area_w_data$totalExposure,
    title = "Total Exposure",
    labFormat = labelFormat(suffix = ""),
    opacity = 1
  )

Finally, plot the claim frequencies

colors <- colorNumeric("YlOrRd", area_w_data$Obs.Claim.Freq)
# Create leaflet map
leaflet(area_w_data) %>%
  addPolygons(
    color = "#444444", weight = 1, smoothFactor = 0.5,
    opacity = 1.0, fillOpacity = 0.5,
    fillColor = ~ colors(Obs.Claim.Freq),
    highlightOptions = highlightOptions(
      color = "white", weight = 2,
      bringToFront = TRUE,
    ),
    popup = ~ paste(
      "Region: ", NAME_1, "<br>",
      "Exposure: ", percent(Obs.Claim.Freq, accuracy=0.01)
    )
  ) %>%
  addTiles() %>%
  leaflet::addLegend(
    position = "bottomright",
    pal = colors,
    values = area_w_data$Obs.Claim.Freq,
    title = "Obs. Claim Frequency",
    labFormat = labelFormat(suffix = ""),
    opacity = 1
  )

Density

The Density represents here the density of the population at the place of residence. Let us take a look at the densities in the dataset.

summary(dataset$Density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2      67     288    1987    1414   27000
ggplot(dataset, aes(Density)) +
  geom_histogram(bins = 200)

Here, contrary to the age of the driver, or the age of the car, the density has lots of different values, as we can see below.

length(unique(dataset$Density))
## [1] 1270

Let us still compute as before the summary statistics and plot them …

density_summary <- dataset %>%
  group_by(Density) %>%
  summarise(
    totalExposure = sum(Exposure),
    Number.Observations = length(Exposure),
    Number.Claims = sum(ClaimNb),
    Obs.Claim.Freq = sum(ClaimNb) / sum(Exposure)
  )
ggplot(density_summary, aes(x = Density, y = Obs.Claim.Freq)) +
  geom_point()

… but realize it is impossible to see a trend. One way out is to categorize the variable. We will see later (GAM) that it is possible to estimate a smooth function, which avoid the arbitrary categorization.

We can categorize the variable using the function cut.

dataset$DensityCAT <- cut(dataset$Density,
  breaks = quantile(dataset$Density, probs = seq(from = 0, to = 1, by = 0.1)),
  include.lowest = TRUE
)
levels(dataset$DensityCAT) <- LETTERS[1:10]
table(dataset$DensityCAT)
## 
##     A     B     C     D     E     F     G     H     I     J 
## 41494 41173 41330 40432 41028 41408 40889 41171 42408 39531

Then, we can apply the same strategy as above.

density_summary <- dataset %>%
  group_by(DensityCAT) %>%
  summarise(
    totalExposure = sum(Exposure),
    Number.Observations = length(Exposure),
    Number.Claims = sum(ClaimNb),
    Obs.Claim.Freq = sum(ClaimNb) / sum(Exposure)
  )

ggplot(
  density_summary,
  aes(
    x = DensityCAT,
    y = Obs.Claim.Freq,
    fill = DensityCAT,
    label = label_percent()(Obs.Claim.Freq)
  )
) +
  geom_bar(stat = "identity") +
  geom_label() +
  guides(fill = "none") +
  scale_x_discrete(name = "Density") +
  scale_y_continuous(
    "Observed claim Frequency",
    labels = label_percent(),
    expand = expansion(mult = c(0, 0.15))
  )

Interactions

We can of course also dive into some interactions. For instance, we could analyse the effect of the car Age combined with the Fuel (Gas).

Fuel and Car Age

carage_fuel_summary <- dataset %>%
  group_by(CarAge, Gas) %>%
  summarise(
    totalExposure = sum(Exposure),
    Number.Observations = length(Exposure),
    Number.Claims = sum(ClaimNb),
    Obs.Claim.Freq = sum(ClaimNb) / sum(Exposure)
  )
## `summarise()` has grouped output by 'CarAge'. You can override using the
## `.groups` argument.
ggplot(carage_fuel_summary, aes(
  x = CarAge,
  y = Obs.Claim.Freq
)) +
  facet_wrap(~Gas) +
  geom_bar(stat = "identity") +
  scale_x_continuous(name = "Age of the Car", breaks = seq(0, 100, 5)) +
  scale_y_continuous(
    name = "Observed Claim Frequency",
    labels = percent_format(accuracy = 0.01)
  ) +
  theme(legend.position = "none")

# Warning Message -  Explanation here: https://stackoverflow.com/a/69392954

Fuel and Driver Age

We will illustrate another way to show this kind of data, by overlapping both bars.

driverage_fuel_summary <- dataset %>%
  group_by(DriverAge, Gas) %>%
  summarize(Obs.Claim.Freq = sum(ClaimNb) / sum(Exposure))
## `summarise()` has grouped output by 'DriverAge'. You can override using the
## `.groups` argument.
ggplot(data = driverage_fuel_summary, aes(
  x = DriverAge,
  y = Obs.Claim.Freq,
  fill = Gas,
  color = Gas,
  alpha = Gas
)) +
  geom_bar(stat = "identity", position = "identity") +
  scale_x_continuous(name = "Age of the Driver", breaks = seq(0, 100, 5)) +
  scale_y_continuous(
    name = "Observed Claim Frequency",
    labels = label_percent()
  ) +
  scale_colour_manual(values = c("lightblue4", "red")) +
  scale_fill_manual(values = c("lightblue", "pink")) +
  scale_alpha_manual(values = c(.3, .8)) +
  theme_bw()